Importing data from flat files with utils
# Load the knitr and kableExtra packages
library(knitr)
library(kableExtra)
options(knitr.table.format = "html")
# Load the gapminder package
library(gapminder)
# Load the dpylr package
library(dplyr)
# Load the ggplot2 package as well
library(ggplot2)
theme_set(theme_bw()) # pre-set the bw theme.Introduction & read.csv
read.csv
The utils package, which is automatically loaded in your R session on startup, can import CSV files with the read.csv() function.
In this exercise, you’ll be working with swimming_pools.csv; it contains data on swimming pools in Brisbane, Australia (Source: data.gov.au). The file contains the column names in the first row. It uses a comma to separate values within rows.
Type dir() in the console to list the files in your working directory. You’ll see that it contains swimming_pools.csv, so you can start straight away.
dir()## [1] "ch1 - Flat files with utils.rmd" "ch1.pdf"
## [3] "ch1_-_Flat_files_with_utils.html" "ch1_-_Flat_files_with_utils.rmd"
## [5] "ch2 - readr and data.table.rmd" "ch2.pdf"
## [7] "ch3 - Excel data.rmd" "ch3.pdf"
## [9] "ch4 - XLConnect.rmd" "ch4.pdf"
## [11] "renamed.xlsx" "summary.xlsx"
path <- file.path("..", "xDatasets", "swimming_pools.csv")
# Import swimming_pools.csv: pools
pools <- read.csv(path)
# Print the structure of pools
str(pools)## 'data.frame': 20 obs. of 4 variables:
## $ Name : Factor w/ 20 levels "Acacia Ridge Leisure Centre",..: 1 2 3 4 5 6 19 7 8 9 ...
## $ Address : Factor w/ 20 levels "1 Fairlead Crescent, Manly",..: 5 20 18 10 9 11 6 15 12 17 ...
## $ Latitude : num -27.6 -27.6 -27.6 -27.5 -27.4 ...
## $ Longitude: num 153 153 153 153 153 ...
stringsAsFactors
With stringsAsFactors, you can tell R whether it should convert strings in the flat file to factors.
For all importing functions in the utils package, this argument is TRUE, which means that you import strings as factors. This only makes sense if the strings you import represent categorical variables in R. If you set stringsAsFactors to FALSE, the data frame columns corresponding to strings in your text file will be character.
You’ll again be working with the swimming_pools.csv file. It contains two columns (Name and Address), which shouldn’t be factors.
path <- file.path("..", "xDatasets", "swimming_pools.csv")
# Import swimming_pools.csv correctly: pools
pools <- read.csv(path, stringsAsFactors = FALSE)
# Check the structure of pools
str(pools)## 'data.frame': 20 obs. of 4 variables:
## $ Name : chr "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
## $ Address : chr "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
## $ Latitude : num -27.6 -27.6 -27.6 -27.5 -27.4 ...
## $ Longitude: num 153 153 153 153 153 ...
read.delim & read.table
read.delim
Aside from .csv files, there are also the .txt files which are basically text files. You can import these functions with read.delim(). By default, it sets the sep argument to "\t" (fields in a record are delimited by tabs) and the header argument to TRUE (the first row contains the field names).
In this exercise, you will import hotdogs.txt, containing information on sodium and calorie levels in different hotdogs (Source: UCLA). The dataset has 3 variables, but the variable names are not available in the first line of the file. The file uses tabs as field separators.
path <- file.path("..", "xDatasets", "hotdogs.txt")
# Import hotdogs.txt: hotdogs
hotdogs <- read.delim(path,
header = FALSE,
col.names = c("type", "calories", "sodium"))
# Summarize hotdogs
sum_hotdogs <- as.data.frame(do.call(cbind, lapply(hotdogs[,2:3], summary)))
# Round to 2 decimals
sum_hotdogs <- round(sum_hotdogs,2)
sum_hotdogs %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| calories | sodium | |
|---|---|---|
| Min. | 86.00 | 144.00 |
| 1st Qu. | 132.00 | 362.50 |
| Median | 145.00 | 405.00 |
| Mean | 145.44 | 424.83 |
| 3rd Qu. | 172.75 | 503.50 |
| Max. | 195.00 | 645.00 |
read.table
If you’re dealing with more exotic flat file formats, you’ll want to use read.table(). It’s the most basic importing function; you can specify tons of different arguments in this function. Unlike read.csv() and read.delim(), the header argument defaults to FALSE and the sep argument is "" by default.
Up to you again! The data is still hotdogs.txt. It has no column names in the first row, and the field separators are tabs. This time, though, the file is in the data folder inside your current working directory. A variable path with the location of this file is already coded for you.
# Path to the hotdogs.txt file: path
path <- file.path("..", "xDatasets", "hotdogs.txt")
# Import the hotdogs.txt file: hotdogs
hotdogs <- read.table(path,
sep = "",
col.names = c("type", "calories", "sodium"))
# Call head() on hotdogs
hotdogs %>%
head %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| type | calories | sodium |
|---|---|---|
| Beef | 186 | 495 |
| Beef | 181 | 477 |
| Beef | 176 | 425 |
| Beef | 149 | 322 |
| Beef | 184 | 482 |
| Beef | 190 | 587 |
Arguments
Lily and Tom are having an argument because they want to share a hot dog but they can’t seem to agree on which one to choose. After some time, they simply decide that they will have one each. Lily wants to have the one with the fewest calories while Tom wants to have the one with the most sodium.
Next to calories and sodium, the hotdogs have one more variable: type. This can be one of three things: Beef, Meat, or Poultry, so a categorical variable: a factor is fine.
# Finish the read.delim() call
hotdogs <- read.delim(path, header = FALSE, col.names = c("type", "calories", "sodium"))
# Select the hot dog with the least calories: lily
lily <- hotdogs[which.min(hotdogs$calories), ]
# Select the observation with the most sodium: tom
tom <- hotdogs[which.max(hotdogs$sodium), ]
# Print lily and tom
lily %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| type | calories | sodium | |
|---|---|---|---|
| 50 | Poultry | 86 | 358 |
tom %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| type | calories | sodium | |
|---|---|---|---|
| 15 | Beef | 190 | 645 |
Column classes
Next to column names, you can also specify the column types or column classes of the resulting data frame. You can do this by setting the colClasses argument to a vector of strings representing classes:
read.delim("my_file.txt",
colClasses = c("character",
"numeric",
"logical"))This approach can be useful if you have some columns that should be factors and others that should be characters. You don’t have to bother with stringsAsFactors anymore; just state for each column what the class should be.
If a column is set to "NULL" in the colClasses vector, this column will be skipped and will not be loaded into the data frame.
# Previous call to import hotdogs.txt
hotdogs <- read.delim(path, header = FALSE, col.names = c("type", "calories", "sodium"))
# Display structure of hotdogs
str(hotdogs)## 'data.frame': 54 obs. of 3 variables:
## $ type : Factor w/ 3 levels "Beef","Meat",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ calories: int 186 181 176 149 184 190 158 139 175 148 ...
## $ sodium : int 495 477 425 322 482 587 370 322 479 375 ...
# Edit the colClasses argument to import the data correctly: hotdogs2
hotdogs2 <- read.delim(path, header = FALSE,
col.names = c("type", "calories", "sodium"),
colClasses = c("factor", "NULL", "numeric"))
# Display structure of hotdogs2
str(hotdogs2)## 'data.frame': 54 obs. of 2 variables:
## $ type : Factor w/ 3 levels "Beef","Meat",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ sodium: num 495 477 425 322 482 587 370 322 479 375 ...